pacman::p_load(jsonlite, tidygraph, ggraph,
visNetwork, graphlayouts, ggforce,
skimr, tidytext, tidyverse, patchwork, ggiraph)Take Home_Ex03
1. Background
FishEye International, a non-profit focused on countering illegal, unreported, and unregulated (IUU) fishing, has been given access to an international finance corporation’s database on fishing related companies. In the past, FishEye has determined that companies with anomalous structures are far more likely to be involved in IUU (or other “fishy” business). FishEye has transformed the database into a knowledge graph. It includes information about companies, owners, workers, and financial status. FishEye is aiming to use this graph to identify anomalies that could indicate a company is involved in IUU.
With reference to Mini-Challenge 3 of VAST Challenge 2023 and by using appropriate static and interactive statistical graphics methods, we will be helping FishEye to better understand fishing business anomalies.
2. Data Source
The data is taken from the Mini-Challenge 3 of VAST Challenge 2023.
3. Data Preparation
3.1 Install and launching R packages
The code chunk below uses p_load() of pacman package to check if packages are installed in the computer. If they are, then they will be launched into R. The R packages installed are:
3.2 Loading the Data
fromJSON() of jsonlite package is used to import MC3.json into R environment.
mc3_data <- fromJSON("data/MC3.json")The output is called mc3_data. It is a large list R object.
3.3 Extracting edges
The code chunk below will be used to extract the links data.frame of mc3_data and save it as a tibble data.frame called mc3_edges.
mc3_edges <- as_tibble(mc3_data$links) %>%
distinct() %>%
mutate(source = as.character(source),
target = as.character(target),
type = as.character(type)) %>%
group_by(source, target, type) %>%
summarise(weights = n()) %>%
filter(source!=target) %>%
ungroup()3.4 Extracting nodes
The code chunk below will be used to extract the nodes data.frame of mc3_data and save it as a tibble data.frame called mc3_nodes.
mc3_nodes <- as_tibble(mc3_data$nodes) %>%
mutate(country = as.character(country),
id = as.character(id),
product_services = as.character(product_services),
revenue_omu = as.numeric(as.character(revenue_omu)),
type = as.character(type)) %>%
select(id, country, type, revenue_omu, product_services) #select() used to organise the sequence of col3.4 Initial Data Exploration
3.4.1 Exploring the edges data frame
In the code chunk below, skim() of skimr package is used to display the summary statistics of mc3_edges tibble data frame.
skim(mc3_edges)| Name | mc3_edges |
| Number of rows | 24036 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1 | 6 | 700 | 0 | 12856 | 0 |
| target | 0 | 1 | 6 | 28 | 0 | 21265 | 0 |
| type | 0 | 1 | 16 | 16 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| weights | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
The report above reveals that there is no missing values in all fields.
In the code chunk below, datatable() of DT package is used to display mc3_edges tibble data frame as an interactive table on the html document.
DT::datatable(mc3_edges)Now, we will plot the distribution of the type of relationship that exist between the source and target and their corresponding frequency.
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
hist_type <- ggplot(data = mc3_edges,
aes(x = type)) +
geom_bar() +
geom_text(stat = 'count', aes(label = ..count..), vjust = -0.1) +
labs(title = "Distribution of Relationship Types", x = "Type", y = "Count") +
theme(plot.title = element_text(face = "bold"))
# hist_typeThere are two types of relationship; Beneficial Owner and Company Contacts, with a total of 16,792 count for the former and 7244 for the latter.
Next, we will explore further the number of companies that a owner usually owns. If we observe that the owner owns more companies compared to the norm, these owners may be flagged as suspicious and we could further focus our investigation on them.
To begin, we will first filter out those type == “Beneficial Owner” and the code chunk are as shown below,
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
mc3_edges_owner <- mc3_edges %>%
filter(type == "Beneficial Owner") %>%
group_by(target, type) %>%
summarise(no_of_companies = n()) %>%
ungroup()
mc3_edges_owner# A tibble: 15,305 × 3
target type no_of_companies
<chr> <chr> <int>
1 Aaron Adams Beneficial Owner 1
2 Aaron Adkins Beneficial Owner 1
3 Aaron Allen Beneficial Owner 1
4 Aaron Alvarez Beneficial Owner 1
5 Aaron Baker Beneficial Owner 1
6 Aaron Beasley Beneficial Owner 1
7 Aaron Berry Beneficial Owner 1
8 Aaron Black Beneficial Owner 1
9 Aaron Boyle Beneficial Owner 1
10 Aaron Carroll Beneficial Owner 1
# ℹ 15,295 more rows
We can also plot out the distribution of companies beneficial owners own using ggplot.
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
# Create a ggplot histogram
gg_hist_own <- ggplot(mc3_edges_owner, aes(x = no_of_companies)) +
geom_histogram() +
labs(title = "No of companies beneficial owners own", x = "No of companies", y = "Count") +
theme(plot.title = element_text(face = "bold")) +
scale_x_continuous(breaks = seq(min(mc3_edges_owner$no_of_companies), max(mc3_edges_owner$no_of_companies), by = 1))
# Calculate frequency counts for each bin
freq_counts <- table(mc3_edges_owner$no_of_companies)
# Create a data frame for labels
label_data <- data.frame(x = as.numeric(names(freq_counts)), y = as.numeric(freq_counts))
# Add frequency labels to the plot
gg_hist_own <- gg_hist_own +
geom_text(
data = label_data,
aes(x = x, y = y, label = y),
vjust = -0.5,
size = 3
)
# Display the ggplot histogram
# gg_hist_ownWe can combine the plot of the distribution of the type of relationship and the distribution of companies beneficial owners own using patchwork.
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
combined_plot <- hist_type / gg_hist_own
combined_plot
As we can see above, there are a small percentage (<0.5%) of beneficial owners that own more than 3 companies. These owners will be flagged as suspicious, and we will perform further investigations on them.
Next, I will create a new dataframe for edge called mc3_edges_with_no_of_companies which has the no_of_companies column added in.
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
# Join the no_of_companies column from mc3_edges_owner into mc3_edges
mc3_edges_with_no_of_companies <- mc3_edges %>%
left_join(mc3_edges_owner %>% select(target, no_of_companies),
by = c("target" = "target")) %>%
mutate(no_of_companies = ifelse(is.na(no_of_companies), 0, no_of_companies))
# View the updated mc3_edges
mc3_edges_with_no_of_companies# A tibble: 24,036 × 5
source target type weights no_of_companies
<chr> <chr> <chr> <int> <dbl>
1 1 AS Marine sanctuary Christina Taylor Compa… 1 1
2 1 AS Marine sanctuary Debbie Sanders Benef… 1 1
3 1 Ltd. Liability Co Cargo Angela Smith Benef… 1 1
4 1 S.A. de C.V. Catherine Cox Compa… 1 0
5 1 and Sagl Forwading Angela Mendoza Compa… 1 0
6 1 and Sagl Forwading Christopher Watson Benef… 1 1
7 2 Limited Liability Company Amanda Mcdonald Benef… 1 1
8 2 Limited Liability Company Megan Padilla Compa… 1 0
9 2 Limited Liability Company Monica Martinez Compa… 1 0
10 2 Limited Liability Company Teresa Collins Benef… 1 1
# ℹ 24,026 more rows
3.4.3 Exploring the nodes data frame
In the code chunk below, skim() of skimr package is used to display the summary statistics of mc3_nodes tibble data frame.
skim(mc3_nodes)| Name | mc3_nodes |
| Number of rows | 27622 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 6 | 64 | 0 | 22929 | 0 |
| country | 0 | 1 | 2 | 15 | 0 | 100 | 0 |
| type | 0 | 1 | 7 | 16 | 0 | 3 | 0 |
| product_services | 0 | 1 | 4 | 1737 | 0 | 3244 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| revenue_omu | 21515 | 0.22 | 1822155 | 18184433 | 3652.23 | 7676.36 | 16210.68 | 48327.66 | 310612303 | ▇▁▁▁▁ |
In the code chunk below, datatable() of DT package is used to display mc3_nodes tibble data frame as an interactive table on the html document.
DT::datatable(mc3_nodes)For product services column that have NA values, we will input the value as “0”. For revenue_omu column that has NA or unknown value, we will replace it as “unknown”.
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
mc3_nodes <- mc3_nodes %>%
mutate(product_services = ifelse(product_services == "character(0)", "unknown", product_services),
revenue_omu = ifelse(revenue_omu == "" | is.na(revenue_omu), "0", revenue_omu))Distribution of the type of nodes
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
hist_type_node <- ggplot(data = mc3_nodes,
aes(x = type)) +
geom_bar()+
geom_text(stat = 'count', aes(label = ..count..), vjust = -0.1) +
labs(title = "Distribution of Node Type", x = "Type", y = "Count") +
theme_bw() +
theme(plot.title = element_text(face = "bold"))
#hist_type_nodeDistribution of number of countries for each id
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
# Count the number of unique countries for each ID
country_counts <- mc3_nodes %>%
group_by(id) %>%
summarize(unique_countries = n_distinct(country))
#Calculate the no of unique countries each ID has
# Calculate the frequency count for each country
frequency_table_country <- table(country_counts$unique_countries)
# Convert the frequency table to a data frame
frequency_df_country <- as.data.frame(frequency_table_country)
# Rename the columns
colnames(frequency_df_country) <- c("Unique Countries", "Frequency")
# Display the frequency table
frequency_df_country Unique Countries Frequency
1 1 22783
2 2 131
3 3 12
4 4 2
5 9 1
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
# Plot the frequency table as a bar plot with labels
hist_country <- ggplot(frequency_df_country, aes(x = `Unique Countries`, y = Frequency)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_text(aes(label = Frequency), vjust = -0.5, size = 3.5) + # Add labels to the bars
labs(title = "Count of Countries for each ID",
x = "No of Countries",
y = "Count") +
theme_bw() +
theme(plot.title = element_text(face = "bold"))
#hist_countryFrom the above plot, we could see there are 146 ids that have more than 1 countries, which calls for suspicious.
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
# Count the number of unique rev for each ID
rev_counts <- mc3_nodes %>%
group_by(id) %>%
summarize(unique_rv = n_distinct(revenue_omu))
# Display the resulting data frame
#rev_counts
# Calculate the frequency count for each id
frequency_table_rev <- table(rev_counts$unique_rv)
# Convert the frequency table to a data frame
frequency_df_rev <- as.data.frame(frequency_table_rev)
# Rename the columns
colnames(frequency_df_rev) <- c("Unique rev", "Frequency")
# Display the frequency table
frequency_df_rev Unique rev Frequency
1 1 22238
2 2 591
3 3 76
4 4 14
5 5 4
6 6 2
7 7 2
8 10 1
9 11 1
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
# Plot the frequency table as a bar plot with labels
hist_rev <- ggplot(frequency_df_rev, aes(x = `Unique rev`, y = Frequency)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_text(aes(label = Frequency), vjust = -0.5, size = 3.5) + # Add labels to the bars
labs(title = "Count of no of rev for each ID",
x = "No of rev",
y = "Count") +
theme_bw() +
theme(plot.title = element_text(face = "bold"))
#hist_revFrom the above, we can also see that there are 691 ids that have more than 1 revenue reflected.
Combine the different plots using patchwork as shown by code chunk below,
Show the code
#| echo: false
#| fig-width: 4
#| fig-height: 4
combine_plot_node <- hist_type_node / (hist_country + hist_rev)
combine_plot_node
Now, I want to a new dataframe for nodes called mc3_nodes_updated to store the frequency of countries and revenue we derive earlier on to see which id these belongs to.
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
# Join the unique_countries column from country_counts into mc3_nodes
mc3_nodes_updated <- mc3_nodes %>%
left_join(country_counts %>% select(id, unique_countries),
by = c("id" = "id"))
# Join the unique_rv column from rev_counts into mc3_nodes
mc3_nodes_updated <- mc3_nodes_updated %>%
left_join(rev_counts %>% select(id, unique_rv),
by = c("id" = "id"))
# View the updated mc3_nodes
mc3_nodes_updated# A tibble: 27,622 × 7
id country type revenue_omu product_services unique_countries unique_rv
<chr> <chr> <chr> <chr> <chr> <int> <int>
1 Jones … ZH Comp… 310612303.… Automobiles 1 2
2 Colema… ZH Comp… 162734683.… Passenger cars,… 1 1
3 Aqua A… Oceanus Comp… 115004666.… Holding firm wh… 1 1
4 Makumb… Utopor… Comp… 90986412.5… Car service, ca… 1 1
5 Taylor… ZH Comp… 81466666.6… Fully electric … 1 1
6 Harmon… ZH Comp… 75070434.9… Discount superm… 1 1
7 Punjab… Riodel… Comp… 72167572.0… Beef, pork, chi… 1 1
8 Assam … Utopor… Comp… 72162317.2… Power and Gas s… 2 2
9 Ianira… Rio Is… Comp… 68832979.2… Light commercia… 1 1
10 Moran,… ZH Comp… 65592905.5… Automobiles, tr… 1 1
# ℹ 27,612 more rows
3.4.2 Initial Network Visualisation and Analysis
Building network model with tidygraph
filtered_mc3_edges_owner <- mc3_edges_with_no_of_companies %>%
filter(no_of_companies > 3, type == "Beneficial Owner")
filtered_mc3_edges_owner# A tibble: 313 × 5
source target type weights no_of_companies
<chr> <chr> <chr> <int> <dbl>
1 Acevedo, Dickson and Gonzalez Richard Smith Bene… 1 6
2 Adams Group John Smith Bene… 1 9
3 Adams-Pope Michelle Rodr… Bene… 1 4
4 Adriatic Catch S.A. de C.V. David Jones Bene… 1 6
5 Albertine Rift NV Family Michael Taylor Bene… 1 4
6 Alexander PLC David Jones Bene… 1 6
7 Alvarez Ltd Michael Carter Bene… 1 5
8 Alvarez, Young and Ramos Michael Miller Bene… 1 5
9 Ancla del Este Ltd. Liability Co Aaron Jones Bene… 1 4
10 Ancla del Este Sp Fish John Jones Bene… 1 4
# ℹ 303 more rows
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
# Create a data frame with source nodes and rename column
id1 <- filtered_mc3_edges_owner %>%
select(source) %>%
rename(id = source) %>%
mutate(type_node = "company")
# Create a data frame with target nodes and rename column
id2 <- filtered_mc3_edges_owner %>%
select(target, type) %>%
rename(id = target, type_node = type)
# Combine the two data frames and remove duplicates
mc3_nodes1 <- rbind(id1, id2) %>%
distinct()
#see if need add in some of the nodes detail
mc3_nodes1# A tibble: 362 × 2
id type_node
<chr> <chr>
1 Acevedo, Dickson and Gonzalez company
2 Adams Group company
3 Adams-Pope company
4 Adriatic Catch S.A. de C.V. company
5 Albertine Rift NV Family company
6 Alexander PLC company
7 Alvarez Ltd company
8 Alvarez, Young and Ramos company
9 Ancla del Este Ltd. Liability Co company
10 Ancla del Este Sp Fish company
# ℹ 352 more rows
DT::datatable(mc3_nodes1)mc3_graph <- tbl_graph(nodes = mc3_nodes1,
edges = filtered_mc3_edges_owner,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness(),
closeness_centrality = centrality_closeness())Show the code
#| echo: false
#| fig-width: 4
#| fig-height: 4
# Set a seed for reproducibility
set.seed(123)
mc3_graph %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha=0.5)) +
geom_node_point(aes(
size = betweenness_centrality,
colors = "lightblue",
alpha = 0.5)) +
scale_size_continuous(range=c(1,10))+
theme_graph()
Preparing Network Data for visNetowrk
Instead of plotting static network graph, we can plot interactive network graph by using visNetwork package. Before we can plot a interactive network graph by using visNetwork package, we are required to prepare two tibble data frames, one for the nodes and the other one for the edges.
Preparing edges tibble data frame
edges_df <- mc3_graph %>%
activate(edges) %>%
as.tibble()
edges_df# A tibble: 313 × 5
from to type weights no_of_companies
<int> <int> <chr> <int> <dbl>
1 1 296 Beneficial Owner 1 6
2 2 297 Beneficial Owner 1 9
3 3 298 Beneficial Owner 1 4
4 4 299 Beneficial Owner 1 6
5 5 300 Beneficial Owner 1 4
6 6 299 Beneficial Owner 1 6
7 7 301 Beneficial Owner 1 5
8 8 302 Beneficial Owner 1 5
9 9 303 Beneficial Owner 1 4
10 10 304 Beneficial Owner 1 4
# ℹ 303 more rows
Preparing nodes tibble data frame
In this section, we will prepare a nodes tibble data frame by using the code chunk below.
nodes_df <- mc3_graph %>%
activate(nodes) %>%
as.tibble() %>%
rename(label = id) %>%
mutate(id=row_number()) %>%
select(everything()) %>%
relocate(id, .before = label)nodes_df <- nodes_df %>%
rename(group = type_node) Show the code
#| echo: false
#| fig-width: 4
#| fig-height: 4
# Plot the network graph with labeled nodes using visNetwork
visNetwork(nodes_df, edges_df, main = list(text = "Network Graph of Company and Beneficial Owner",
style = "color: black; font-weight: bold; text-align: center;")) %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visLayout(randomSeed = 123) %>%
addFontAwesome(name ="font-awesome") %>%
visGroups(groupname = "company", shape = "icon",
icon = list(code = "f0f7", color = "#000000")) %>%
visGroups(groupname = "Beneficial Owner", shape = "icon",
icon = list(code = "f2bd")) %>%
visLegend() %>%
visOptions(
highlightNearest = TRUE,
nodesIdSelection = TRUE,
) %>%
visInteraction(
zoomView = TRUE,
dragNodes = TRUE,
dragView = TRUE,
navigationButtons = TRUE,
selectable = TRUE, # Enable node selection
hover = TRUE, # Enable hover effects
)Similarly, to plot the network graph of Company and Company Contacts, we do the same as above,
#Filter the type = "Company Contacts"
mc3_edges_cc<- mc3_edges_with_no_of_companies %>%
filter(no_of_companies > 3, type == "Company Contacts")
mc3_edges_cc# A tibble: 72 × 5
source target type weights no_of_companies
<chr> <chr> <chr> <int> <dbl>
1 Adriatic Tuna GmbH & Co. KG Chris… Comp… 1 4
2 Alvarez and Sons Rober… Comp… 1 4
3 Andhra Pradesh Limited Liability Comp… Miche… Comp… 1 4
4 Austin-Porter Micha… Comp… 1 4
5 Bahía del Este Ges.m.b.H. Micha… Comp… 1 4
6 Baker-Savage Melis… Comp… 1 4
7 Brown-Frank John … Comp… 1 9
8 Caracola del Este Sagl Solutions Micha… Comp… 1 5
9 Clayton Ltd Brian… Comp… 1 5
10 Coleman, Harris and Mitchell John … Comp… 1 7
# ℹ 62 more rows
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
# Create a data frame with source nodes and rename column
id3 <- mc3_edges_cc %>%
select(source) %>%
rename(id = source) %>%
mutate(type_node = "company")
# Create a data frame with target nodes and rename column
id4 <- mc3_edges_cc %>%
select(target, type) %>%
rename(id = target, type_node = type)
# Combine the two data frames and remove duplicates
mc3_nodes2 <- rbind(id3, id4) %>%
distinct()
#see if need add in some of the nodes detail mc3_graph2 <- tbl_graph(nodes = mc3_nodes2,
edges = mc3_edges_cc,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness(),
closeness_centrality = centrality_closeness())Show the code
#| echo: false
#| fig-width: 4
#| fig-height: 4
# Set a seed for reproducibility
set.seed(123)
mc3_graph2 %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha=0.5)) +
geom_node_point(aes(
size = betweenness_centrality,
colors = "lightblue",
alpha = 0.5)) +
scale_size_continuous(range=c(1,10))+
theme_graph()
edges_df_2 <- mc3_graph2 %>%
activate(edges) %>%
as.tibble()nodes_df_2 <- mc3_graph2 %>%
activate(nodes) %>%
as.tibble() %>%
rename(label = id) %>%
mutate(id=row_number()) %>%
select(everything()) %>%
relocate(id, .before = label)nodes_df_2 <- nodes_df_2 %>%
rename(group = type_node) Show the code
#| echo: false
#| fig-width: 4
#| fig-height: 4
# Plot the network graph with labeled nodes using visNetwork
visNetwork(nodes_df_2, edges_df_2, main = list(text = "Network Graph of Company and Company Contacts",
style = "color: black; font-weight: bold; text-align: center;")) %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visLayout(randomSeed = 123) %>%
addFontAwesome(name ="font-awesome") %>%
visGroups(groupname = "company", shape = "icon",
icon = list(code = "f0f7", color = "#000000")) %>%
visGroups(groupname = "Company Contacts", shape = "icon",
icon = list(code = "f0c0")) %>%
visOptions(
highlightNearest = TRUE,
nodesIdSelection = TRUE,
) %>%
visLegend() %>%
visInteraction(
zoomView = TRUE,
dragNodes = TRUE,
dragView = TRUE,
navigationButtons = TRUE,
selectable = TRUE, # Enable node selection
hover = TRUE, # Enable hover effects
)Top 5% revenue
filtered_mc3_edges <- mc3_edges_with_no_of_companies %>%
filter(no_of_companies > 3)Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
# Create a data frame with source nodes and rename column
id4 <- filtered_mc3_edges %>%
select(source) %>%
rename(id = source) %>%
mutate(type_node = "company")
# Create a data frame with target nodes and rename column
id5 <- filtered_mc3_edges %>%
select(target, type) %>%
rename(id = target, type_node = type)
# Combine the two data frames and remove duplicates
mc3_nodes3 <- rbind(id4, id5) %>%
distinct() %>%
left_join(mc3_nodes_updated,
unmatched = "drop") %>%
distinct()
mc3_nodes3 <- mc3_nodes3 %>%
mutate(revenue_omu = ifelse(revenue_omu == "" | is.na(revenue_omu), "0", revenue_omu))
#see if need add in some of the nodes detail
mc3_nodes3# A tibble: 535 × 8
id type_node country type revenue_omu product_services unique_countries
<chr> <chr> <chr> <chr> <chr> <chr> <int>
1 Aceved… company <NA> <NA> 0 <NA> NA
2 Adams … company ZH Comp… 9056.2418 A range of fish… 1
3 Adams … company ZH Bene… 0 unknown 1
4 Adams … company ZH Comp… 0 unknown 1
5 Adams-… company <NA> <NA> 0 <NA> NA
6 Adriat… company Puerto… Comp… 8869.44 Technical testi… 1
7 Adriat… company Oceanus Comp… 29366.6728 Integrated frei… 1
8 Albert… company Marebak Comp… 9760.8727 Alaska Pollock,… 1
9 Alexan… company ZH Bene… 0 unknown 1
10 Alvare… company ZH Bene… 0 unknown 1
# ℹ 525 more rows
# ℹ 1 more variable: unique_rv <int>
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
# Convert the revenue column to numeric (if it's not already numeric)
mc3_nodes3$revenue_omu <- as.numeric(mc3_nodes3$revenue_omu)
# Calculate the revenue threshold for the top 20% excluding non-numeric or missing values
revenue_threshold <- quantile(mc3_nodes3$revenue_omu, probs = 0.90, na.rm = TRUE)
# Filter the DataFrame to retain only the rows with revenue above the threshold
filtered_mc3_nodes <- mc3_nodes3[mc3_nodes3$revenue_omu > revenue_threshold, ]
# View the filtered DataFrame
filtered_mc3_nodes# A tibble: 54 × 8
id type_node country type revenue_omu product_services unique_countries
<chr> <chr> <chr> <chr> <dbl> <chr> <int>
1 Ancla … company Uzifri… Comp… 130212. Operation of fi… 1
2 Andhra… company Rio Is… Comp… 787121. Grocery products 1
3 Bahía … company Novarc… Comp… 60335. Fabricated meta… 1
4 Bahía … company Oceanus Comp… 254667. Swimwear and fa… 2
5 Bahía … company Novarc… Comp… 98065. Contract manufa… 3
6 Bahía … company Utopor… Comp… 67616. Gelatin 3
7 Baker … company ZH Comp… 104095830. Fish; fresh or … 1
8 BlueWa… company Zawali… Comp… 199596. Canned Products… 1
9 Bu yu … company Nalako… Comp… 62860. Gelatine produc… 1
10 Congo … company Riodel… Comp… 106161. Writing tools a… 1
# ℹ 44 more rows
# ℹ 1 more variable: unique_rv <int>
Show the code
#| echo: false
#| fig-width: 5
#| fig-height: 6
# Create a bar chart of revenue vs ID using ggplot
bar_plot_toprev <- ggplot(filtered_mc3_nodes, aes(x = reorder(id, revenue_omu), y = revenue_omu/1000)) +
geom_bar_interactive(aes(tooltip = paste("ID:", id,
"<br>Type:", type_node,
"<br>Country:", country,
"<br>Revenue:", revenue_omu,
"<br>Product Services:", product_services)),
stat = "identity", fill = "steelblue") +
labs(x = "id", y = "Revenue_omu ('000)", title = "Top 10% ids") +
coord_flip() +
theme(plot.title = element_text(face = "bold"))+
theme(axis.text.y = element_text(size = 6))
# Print the bar plot
girafe(ggobj = bar_plot_toprev,
width_svg = 8,
height_svg = 8*0.618)